CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
نویسندگان
چکیده
Lectures #1 and #2 discussed “unstructured data”, where the only information we used about two objects was whether or not they were equal. Last lecture, we started talking about “structured data”. For now, we consider structure expressed as a (dis)similarity measure between pairs of objects. There are many such measures; last lecture we mentioned Jaccard similarity (for sets), L1 and L2 distance (for points in R, when coordinates do or do not have meaning, respectively), edit distance (for strings), etc. How can such structure be leveraged to understand the data? We’re currently focusing on the canonical nearest neighbor problem, where the goal is to find the closest point of a point set to a given point (either another point of the point set or a user-supplied query). Last lecture also covered a solution to the problem, the k-d tree. This is a good “first cut” solution to the problem when the number of dimensions is not too big — less than logarithmic in the size of the point set. When the number k of dimensions is at most 20 or 25, a k-d tree is likely work well. Why do we want the number of dimensions to be small? Because of the curse of dimensionality. Recall that to compute the nearest neighbor of a query q using a k-d tree, one first does a downward traversal through the tree to identify the smallest region of space that contains q. (Nodes of the k-d tree correspond to regions of space, with the regions corresponding to the children y, z of a node x.) Then one does an upward traversal of the tree, checking other cells that could conceivably contain q’s nearest neighbor. The number of cells that have to be checked can scale exponentially with the dimension k. The curse of dimensionality appears to be fundamental to the nearest neighbor problem (and many other geometric problems), and is not an artifact of the specific solution of the k-d
منابع مشابه
CS168: The Modern Algorithmic Toolbox Lecture #14: Linear and Convex Programming, with Applications to Sparse Recovery
Recall the setup in compressive sensing. There is an unknown signal z ∈ R, and we can only glean information about z through linear measurements. We choose m linear measurements a1, . . . , am ∈ R. “Nature” then chooses a signal z, and we receive the results b1 = 〈a1, z〉, . . . , bm = 〈am, z〉 of our measurements, when applied to z. The goal is then to recover z from b. Last lecture culminated i...
متن کاملCS168: The Modern Algorithmic Toolbox Lecture #18: Linear and Convex Programming, with Applications to Sparse Recovery
Recall the setup in compressive sensing. There is an unknown signal z ∈ R, and we can only glean information about z through linear measurements. We choose m linear measurements a1, . . . , am ∈ R. “Nature” then chooses a signal z, and we receive the results b1 = 〈a1, z〉, . . . , bm = 〈am, z〉 of our measurements, when applied to z. The goal is then to recover z from b. Last lecture culminated i...
متن کاملCS168: The Modern Algorithmic Toolbox Lecture #5: Sampling and Estimation
This week, we will cover tools for making inferences based on random samples drawn from some distribution of interest (e.g. a distribution over voter priorities, customer behavior, ip addresses, etc.). We will also learn how to use sampling techniques to solve hard problems— both problems that inherently involve randomness, as well as those that do not. As a warmup, to get into the probabilisti...
متن کاملCS168: The Modern Algorithmic Toolbox Lecture #13: Sampling and Estimation
This week, we will cover tools for making inferences based on random samples drawn from some distribution of interest (e.g. a distribution over voter priorities, customer behavior, ip addresses, etc.). We will also learn how to use sampling techniques to solve hard problems— both problems that inherently involve randomness, as well as those that do not. As a warmup, to get into the probabilisti...
متن کاملCS168: The Modern Algorithmic Toolbox Lecture #6: Markov Chain Monte Carlo
The previous lecture covered several tools for inferring properties of the distribution that underlies a random sample. In this lecture we will see how to design distributions and sampling schemes that will allow us to solve problems we care about. In some instances, the goal will be to understand an existing random process, and in other instances, the problem we hope to solve has no intrinsic ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015